<HTML>
<HEAD>
<TITLE>orderchr - determine a chromosome order to minimize cross-over of links</TITLE>
<LINK REV="made" HREF="mailto:root@gene0.cigenomics.bc.ca">
</HEAD>

<BODY>

<!-- INDEX BEGIN -->

<UL>

	<LI><A HREF="#NAME">NAME</A>
	<LI><A HREF="#SYNOPSIS">SYNOPSIS</A>
	<LI><A HREF="#READ_FIRST">READ FIRST</A>
	<LI><A HREF="#DESCRIPTION">DESCRIPTION</A>
	<LI><A HREF="#DEFINING_WHICH_CHROMOSOMES_TO_SH">DEFINING WHICH CHROMOSOMES TO SHUFFLE</A>
	<UL>

		<LI><A HREF="#MODE_1_shuffle_set_defined_by_">MODE 1 - shuffle set defined by link data</A>
		<LI><A HREF="#MODE_2_shuffle_set_defined_by_">MODE 2 - shuffle set defined by file</A>
		<LI><A HREF="#MODE_3_shuffle_set_defined_by_">MODE 3 - shuffle set defined by regular expression</A>
		<LI><A HREF="#MODE_4_shuffle_set_defined_by_">MODE 4 - shuffle set defined by list</A>
		<LI><A HREF="#MODE_5_multimode">MODE 5 - multimode</A>
	</UL>

	<LI><A HREF="#DEFINING_INITIAL_ORDER">DEFINING INITIAL ORDER</A>
	<LI><A HREF="#DEFINING_WHICH_CHROMOSOMES_TO_RE">DEFINING WHICH CHROMOSOMES TO REMAIN ANCHORED</A>
	<LI><A HREF="#CONTROLING_THE_OPTIMIZATION">CONTROLING THE OPTIMIZATION</A>
	<UL>

		<LI><A HREF="#Warmup_Round">Warmup Round</A>
		<LI><A HREF="#Stochastic_Optimization_Round">Stochastic Optimization Round</A>
	</UL>

	<LI><A HREF="#SIMULATED_ANNEALING">SIMULATED ANNEALING</A>
	<UL>

		<LI><A HREF="#iterations">iterations</A>
		<LI><A HREF="#max_flips_min_flips">max_flips, min_flips</A>
		<LI><A HREF="#temp0">temp0</A>
		<LI><A HREF="#optimize_minimize_maximize">optimize = minimize|maximize</A>
	</UL>

	<LI><A HREF="#SIMULATION_SCHEDULE">SIMULATION SCHEDULE</A>
	<UL>

		<LI><A HREF="#No_Warmup_Single_Stochastic_Rou">No Warmup, Single Stochastic Round</A>
		<LI><A HREF="#No_Warmup_Multiple_Stochastic_R">No Warmup, Multiple Stochastic Rounds</A>
		<LI><A HREF="#Warmup_Multiple_Stochastic_Roun">Warmup, Multiple Stochastic Rounds</A>
	</UL>

	<LI><A HREF="#HISTORY">HISTORY</A>
	<LI><A HREF="#BUGS">BUGS</A>
	<LI><A HREF="#AUTHOR">AUTHOR</A>
	<LI><A HREF="#CONTACT">CONTACT</A>
</UL>
<!-- INDEX END -->

<HR>
<P>
<HR>
<H1><A NAME="NAME">NAME</A></H1>
<P>
orderchr - determine a chromosome order to minimize cross-over of links

<P>
<HR>
<H1><A NAME="SYNOPSIS">SYNOPSIS</A></H1>
<P>
<PRE>  orderchr -links linkfile.txt 
           -karyotype karyotype.txt 
           { -shuffle_file chrs_to_shuffle.txt | -shuffle LIST | -shuffle_rx REGEX_LIST } 
           {-static LIST} {-static_rx REGEX_LIST}
           {-init_order LIST} {-init_order_rx REGEX_LIST}
</PRE>
<P>
<HR>
<H1><A NAME="READ_FIRST">READ FIRST</A></H1>
<P>
This ordering script is most suitable for dense link data (many-to-many)
(i.e. significant fraction of all chromosome pairs have links). If your
data is sparse (one-to-many) (i.e. many chromosomes receive links from only
one chromosome), please try tools/orderchr-deterministic.

<P>
<HR>
<H1><A NAME="DESCRIPTION">DESCRIPTION</A></H1>
<P>
By examining the frequencies of chromosome-chromosome relationships defined
in the link file, this script suggests a new order for chromosomes that
results in fewer cross-overs between links. Simulated annealing is used to
optimize the chromosome order (read below about parameters). Run the
simulation a couple of times to check convergence (finding a global minimum
is not guaranteed).

<P>
<HR>
<H1><A NAME="DEFINING_WHICH_CHROMOSOMES_TO_SH">DEFINING WHICH CHROMOSOMES TO SHUFFLE</A></H1>
<P>
The set of chromosomes to shuffle is specified by either (a) link data,
whereby all chromosomes that have links are subject to shuffling (b)
-shuffle_file, whereby only those chromosomes listed in the file are
shuffled (c) -shuffle, whereby the set of chromosomes to shuffle is
specified by a list, and (d) -shuffle_rx, whereby only those chromosomes
that match a regular expression are shuffled. Any chromosome that is not
identified by one of these methods does not participate in the shuffling
process. Specifically, any links to/from such chromosomes are not
considered when minimizing link crossing.

<P>
<HR>
<H2><A NAME="MODE_1_shuffle_set_defined_by_">MODE 1 - shuffle set defined by link data</A></H2>
<P>
<PRE>  &gt; orderchr -links linkfile.txt
</PRE>
<P>
All chromosomes mentioned in the -links file will be subject to reordering.
The initial order will be taken from order of appearance in the karyotype
file.

<P>
<HR>
<H2><A NAME="MODE_2_shuffle_set_defined_by_">MODE 2 - shuffle set defined by file</A></H2>
<P>
<PRE>  &gt; orderchr -links linkfile.txt -shuffle_file chrs_to_shuffle.txt
</PRE>
<P>
The set of chromosomes to shuffle is given in the -shuffle_file file, which
contains one chromosome per line. For example

<P>
<PRE>  &gt; cat chrs_to_shuffle.txt
  chr1
  chr5
  chr12
  chr17
  ...
</PRE>
<P>
The initial order will be taken from order of appearance in the file.

<P>
<HR>
<H2><A NAME="MODE_3_shuffle_set_defined_by_">MODE 3 - shuffle set defined by regular expression</A></H2>
<P>
<PRE>  &gt; orderchr -links linkfile.txt -shuffle_rx chr1
</PRE>
<P>
Same as MODE 1, except that chromosome list will be filtered using the
regular expression and only those chromosomes that match the regular
expression are shuffled.

<P>
In this example, chromosomes matching ``chr1'' will be shuffled (e.g. chr1,
chr10, chr11, etc).

<P>
<HR>
<H2><A NAME="MODE_4_shuffle_set_defined_by_">MODE 4 - shuffle set defined by list</A></H2>
<P>
<PRE>  &gt; orderchr -links linkfile.txt -shuffle chr1,chr2,chr6,chr7,chr10
</PRE>
<P>
Same as MODE 3, except that chromosomes are specified by a list.

<P>
In this example, chromosomes chr1,chr2,chr6,chr7, and chr10 will be
shuffled.

<P>
<HR>
<H2><A NAME="MODE_5_multimode">MODE 5 - multimode</A></H2>
<P>
You can combine -shuffle_file, -shuffle_rx and -shuffle to additively
define the shuffle list.

<P>
<HR>
<H1><A NAME="DEFINING_INITIAL_ORDER">DEFINING INITIAL ORDER</A></H1>
<P>
The initial order of chromosomes can be defined in two ways. First, the
method that is used to specify which chromosomes to shuffle will dictate
the initial order. Modes 1 and 2 (see above) use the order of chromosomes
as they appear in the karyotype file. Mode 3 (see above) uses the order
from the shuffle file.

<P>
You can override the initial order using the -init_order parameter. The
value of this parameter is expected to a comma-delimited list of
chromosomes, which may be the full set or a subset of chromosomes.

<P>
For example, if the entire set of chromosomes to shuffle is chr1..chr5,
then you can specify the initial order which explicitly orders each
chromosome

<P>
<PRE>  -init_order chr2,chr5,chr1,chr3,chr4
</PRE>
<P>
or just a subset

<P>
<PRE>  -init_order chr2,chr5
</PRE>
<P>
In the latter case, the final order will be

<P>
<PRE>  { chr2,chr5 } , { chr1,chr3,chr4 }
</PRE>
<P>
comprised of two order groups: leading group of chromosomes as ordered by
-init_order and a group of remaining chromosomes, in order of appearance as
set by parameters in the section DEFINING WHICH CHROMOSOMES TO SHUFFLE.

<P>
If a chromosome mentioned in -init_order is not a candidate for shuffling,
its mention in the order string will be ignored.

<P>
The option -init_order_rx works just like -init_order, except that the list
a list of regular expressions rather than chromsome names. For example,

<P>
<PRE>  -init_order chr1,chr2
</PRE>
<P>
is equivalent to

<P>
<PRE>  -init_order { chrs matching /chr1/ },{ chrs matching /chr2/ }
</PRE>
<P>
and for the canonical human genome with standard order this would be

<P>
<PRE>  -init_order chr1,chr10,chr11,chr2,chr20,chr21,chr22
</PRE>
<P>
Since this is a subset of chromosomes, the final initial order will be
automatically completed by chromosomes from the karyotype file that were
not explicitly ordered

<P>
<PRE>  chr1, chr10, chr11, chr2, chr20, chr21, chr22, chr3..chr9, chr12..chr19, chrX, chrY
</PRE>
<P>
If both -init_order_rx and -init_order are defined, order is initially
defined by -init_order_rx and then refined using -init_order. Thus

<P>
<PRE>  -init_order chr10,chr20,chrx -init_order_rx chr1,chr2
</PRE>
<P>
will result in

<P>
<PRE>  chr10, chr20, chrx, chr1, chr11, chr2, chr21, chr22, chr3..chr9, chr12..chr19 ,chrY
</PRE>
<P>
<HR>
<H1><A NAME="DEFINING_WHICH_CHROMOSOMES_TO_RE">DEFINING WHICH CHROMOSOMES TO REMAIN ANCHORED</A></H1>
<P>
After the set of chromosomes to shuffle has been defined, and the initial
order has been set, you can define a subset of chromosomes to remain in the
same order (static) throughout the shuffling process. 

<P>
The difference between a chromosome (a) not being part of a shuffle set and
(b) being part of a shuffle set, but remain static, is that in the former,
links to chromosomes do not play a role in the ordering process whereas in
the latter case links to these chromosomes contribute to the shuffle score.
Thus, chromosomes which are static have all non-static chromosomes shuffled
around them in order to minimize link crossover.

<P>
Defining static chromosomes is done by a comma-delimited list of regular
expressions

<P>
<PRE>  &gt; orderchr -links linkfile.txt -static_rx chr1
</PRE>
<P>
In this example, all chromosomes matching the regular expression chr1 will
not have their order adjusted. Any links to/from these chromosomes will
contribute to the total link crossing score, but the chromosomes themselves
will not be moved. For example, if the original order of chromosomes is

<P>
<PRE>  chr1,chr2,chr3,chr10,chr11,chr20,chr21
</PRE>
<P>
then any shuffle solution will have the order

<P>
<PRE>  chr1,-,-,chr10,chr11,-,-
</PRE>
<P>
with chr1, chr10 and chr11 remaining fixed.

<P>
To define multiple regular expressions, use a list of regular expressions.

<P>
<PRE>  &gt; orderchr -links linkfile.txt -static_rx chr1,x,y
</PRE>
<P>
Like with -init_order, you can use the chromosome names to define static
entries using -static.

<P>
<PRE>  -static chr1,chr2,chr3
</PRE>
<P>
will keep chromosomes chr1, chr2 and chr3 always in the same position. You
can combine -static_rx and -static

<P>
<PRE>  -static_rx chr1,chr2 -static chrx,chry
</PRE>
<P>
in which case all chromosomes that match either the regular expressions
defined by -static_rx or the names defined by -static will be kept in the
same position during shuffling.

<P>
<HR>
<H1><A NAME="CONTROLING_THE_OPTIMIZATION">CONTROLING THE OPTIMIZATION</A></H1>
<P>
The order optimization process comprises one or more rounds. Each round is
defined by a &lt;round&gt; block in the &lt;simulation&gt; block

<P>
<PRE>  &lt;simulation&gt;
   &lt;round&gt;
    # settings for round 1
   &lt;/round&gt;
   &lt;round&gt;
    # settings for round 2
   &lt;/round&gt;
   ...
  &lt;/simulation&gt;
</PRE>
<P>
A round can be either a warmup (read below), or a full simulated annealing
process (read below). The outcome of the warmup is deterministic, and thus
the warmup should only be used as the first (optional) round.

<P>
<HR>
<H2><A NAME="Warmup_Round">Warmup Round</A></H2>
<P>
During the warmup round, the initial order of the chromosomes is defined
based on the degree of connectivity between chromosome pairs. 

<P>
This warmup is most suited for data sets in which most relationships
between chromosomes are many-to-one (e.g., chromosome A has links to many
chromosomes B,C,D,... but each of B,C,D generally only links to A).
Many-to-one data sets are common for alignments (e.g. chr A corresponds to
the chromosome whereas and points to ctg A, B, C, ... all sequence contigs
that map to disjoint regions on chr A).

<P>
The warmup algorithm is as follows. The chromosome (chrA) with the most
links is selected first and used to initialize the new order. A list of all
links to chrA is created, grouped by chromosome, and sorted based on the
average position of the link on chrA. Chromosomes are added to the new
order based on descending order of grouped link position. Once all
chromosomes are placed, the next unplaced chromosome with the most links is
selected and the process continues until all chromosomes are placed.

<P>
The warmup is deterministic - it will result in the same order each time.
It is insensitive to the initial order, or values of -static and
-static_rx.

<P>
<PRE>  &lt;round&gt;
   warmup = yes
  &lt;/round&gt;
</PRE>
<P>
<HR>
<H2><A NAME="Stochastic_Optimization_Round">Stochastic Optimization Round</A></H2>
<P>
After the optional warmup round, all other rounds should be of stochastic
type (this is the default round type, if warmup=yes is not set).

<P>
Parameters for the round are defined as follows

<P>
<PRE>  &lt;round&gt;
    iterations = 1000
    max_flips  = 10
    min_flips  = 2
    temp0      = 0.01
  &lt;round&gt;
</PRE>
<P>
For the details of each parameter, read the section below. You can set
parameter values to be relative to values of the previous round by
prefixing the parameter with ``r''. For example,

<P>
<PRE>  &lt;round&gt;
    iterations = 1000
    ...
    temp0      = 0.01
  &lt;round&gt;
</PRE>
<P>
<PRE>  &lt;round&gt;
    iterations = r2
    ...
    temp0      = r0.5
  &lt;round&gt;
</PRE>
<P>
<PRE>  &lt;round&gt;
    iterations = r2
    ...
    temp0      = r0.5
  &lt;round&gt;
</PRE>
<P>
defines three rounds. The first round has 1000 iterations with temp0=0.01.
The second round has 2x iterations (2000) and a value of temp0 of
0.5*0.01=0.005. The third round has again 2x iterations (4000) and temp0 of
0.5*0.005=0.0025.

<P>
Relative parameter values are very useful for additional rounds when the
transition probability is decreased (temp0 is lowered). You can decrease
temp0 in relative steps, without needing to remember what the previous
value was. This allows you to create a multi-round optimization schedule
with all parameter defined in a single place (first round).

<P>
The solution at the end of a round is used as the initial order for the
next round.

<P>
<HR>
<H1><A NAME="SIMULATED_ANNEALING">SIMULATED ANNEALING</A></H1>
<P>
This method is an optimization method that encourages the discovery of a
global minimum by traversing the space of solutions with a small (and
decreasing as simulation runs) chance of visiting less desirable solutions. 

<P>
There are three parameters that control the optimization.

<P>
<HR>
<H2><A NAME="iterations">iterations</A></H2>
<P>
The number of iterations to perform. At each iteration, the current
solution is randomly modified and either accepted or rejected.

<P>
<HR>
<H2><A NAME="max_flips_min_flips">max_flips, min_flips</A></H2>
<P>
The optimization run is split into max_flips-min_flips+1 equal-sized
intervals. During each iterval, the number of random chromosome pair swaps
in the solution is given by

<P>
<PRE>  min_flips + (max_flips-min_flips)*(1-t)
</PRE>
<P>
where t is a relative round completion time t=0..1 at the current
iteration.

<P>
For example, if max_flips is 5 and min_flips is 2 and iterations=1000. Then
the number of random pair swaps is

<P>
<PRE>  iteration 1-249    5
  iteration 250-499  4
  iteration 500-749  3
  iteration 750-1000 2
</PRE>
<P>
I suggest starting with a value that corresponds to 5% of the chromosomes.
For example, if you have 100 chromosomes, use max_flips=5 to start. It's
also a good idea to set min_flips=1 for the last round to avoid abandoning
the solution (remember that in simulated annealing it is possible to
discard a solution for a worse solution).

<P>
<HR>
<H2><A NAME="temp0">temp0</A></H2>
<P>
This parameter determines the probability of a transition to a less
desirable solution. The transition probability is

<P>
<PRE>  p(dE) = temp0*exp( - dE/t )
</PRE>
<P>
where t=1..0 over the length of the simulation and dE is the relative
change in the desirability of two solutions. 

<P>
If temp0=1, then the probability of accepting a solution that is 10% worse
(e.g. dE=0.1) is

<P>
<PRE>  p(0.1) = exp (-0.1/1)   = 90%    at start of simulation
         = exp (-0.1/0.5) = 82%    half way through simulation
         = exp (-0.1/0.1) = 37%    90% of the way through simulation
</PRE>
<P>
By lowering temp0, you lower the probability of transition to a less
desirable solution. 

<P>
Do not adjust temp0 unless you feel that the simulation is (a) not
traversing the solution space sufficiently - in which case make temp0
larger or (b) too many low-quality solutions are accepted - in which case
make temp0 smaller.

<P>
<HR>
<H2><A NAME="optimize_minimize_maximize">optimize = minimize|maximize</A></H2>
<P>
Most of the time you'll want to adjust the chromosome order in a way to
minimize the number of crossing links. However, you can set to maximize the
number of crossing links by setting

<P>
<PRE>  optimize = maximize
</PRE>
<P>
<HR>
<H1><A NAME="SIMULATION_SCHEDULE">SIMULATION SCHEDULE</A></H1>
<P>
<HR>
<H2><A NAME="No_Warmup_Single_Stochastic_Rou">No Warmup, Single Stochastic Round</A></H2>
<P>
<PRE>  &lt;simulation&gt;
  &lt;round&gt;
   iterations = 1000
   max_flips  = 10 # or set this to ~5% of your chromosomes
   min_flips  = 1
   temp0      = 0.01
  &lt;/round&gt;
  &lt;/simulation&gt;
</PRE>
<P>
<HR>
<H2><A NAME="No_Warmup_Multiple_Stochastic_R">No Warmup, Multiple Stochastic Rounds</A></H2>
<P>
The purpose of rounds 2 and 3 is to successively decrease the transition
probability to worse solutions and also decrease the degree to which
successive candidate solutions vary from the current solution. In these
rounds, a more careful search is carried out around the solution provided
in round 1.

<P>
<PRE>  &lt;simulation&gt;
  &lt;round&gt;
   iterations = 1000
   max_flips  = 10 # or set this to ~5% of your chromosomes
   min_flips  = 1
   temp0      = 0.01
  &lt;/round&gt;
  &lt;/simulation&gt;
  &lt;simulation&gt;
  &lt;round&gt;
   iterations = r2    # 2000
   max_flips  = 2
   min_flips  = 1
   temp0      = r0.5  # 0.005
  &lt;/round&gt;
  &lt;round&gt;
   iterations = r2    # 4000
   max_flips  = 1
   min_flips  = 1
   temp0      = r0.1 # 0.0005
  &lt;/round&gt;
  &lt;/simulation&gt;
</PRE>
<P>
<HR>
<H2><A NAME="Warmup_Multiple_Stochastic_Roun">Warmup, Multiple Stochastic Rounds</A></H2>
<P>
<PRE>  &lt;simulation&gt;
  &lt;round&gt;
   warmup = yes
  &lt;/round&gt;
  &lt;round&gt;
   iterations = 1000
   max_flips  = 10 # or set this to ~5% of your chromosomes
   min_flips  = 1
   temp0      = 0.01
  &lt;/round&gt;
  &lt;/simulation&gt;
  &lt;simulation&gt;
  &lt;round&gt;
   iterations = r2    # 2000
   max_flips  = 2
   min_flips  = 1
   temp0      = r0.5  # 0.005
  &lt;/round&gt;
  &lt;round&gt;
   iterations = r2    # 4000
   max_flips  = 1
   min_flips  = 1
   temp0      = r0.1 # 0.0005
  &lt;/round&gt;
  &lt;/simulation&gt;
</PRE>
<P>
<HR>
<H1><A NAME="HISTORY">HISTORY</A></H1>
<UL>
<LI><STRONG><A NAME="item_14">14 July 2008</A></STRONG>
<P>
Expanded documentation and added _rx parameters.

<LI><STRONG><A NAME="item_8">8 July 2008</A></STRONG>
<P>
Started and versioned.

</UL>
<P>
<HR>
<H1><A NAME="BUGS">BUGS</A></H1>
<P>
<HR>
<H1><A NAME="AUTHOR">AUTHOR</A></H1>
<P>
Martin Krzywinski

<P>
<HR>
<H1><A NAME="CONTACT">CONTACT</A></H1>
<P>
<PRE>  Martin Krzywinski
  Genome Sciences Centre
  Vancouver BC Canada
  www.bcgsc.ca
  martink@bcgsc.ca
</PRE>
</BODY>

</HTML>
